Goto

Collaborating Authors

 pattern matching


Characterizing Pattern Matching and Its Limits on Compositional Task Structures

Chang, Hoyeon, Park, Jinho, Cho, Hanseul, Yang, Sohee, Ko, Miyoung, Hwang, Hyeonbin, Won, Seungpil, Lee, Dohaeng, Ahn, Youbin, Seo, Minjoon

arXiv.org Artificial Intelligence

Despite impressive capabilities, LLMs' successes often rely on pattern-matching behaviors, yet these are also linked to OOD generalization failures in compositional tasks. However, behavioral studies commonly employ task setups that allow multiple generalization sources (e.g., algebraic invariances, structural repetition), obscuring a precise and testable account of how well LLMs perform generalization through pattern matching and their limitations. To address this ambiguity, we first formalize pattern matching as functional equivalence, i.e., identifying pairs of subsequences of inputs that consistently lead to identical results when the rest of the input is held constant. Then, we systematically study how decoder-only Transformer and Mamba behave in controlled tasks with compositional structures that isolate this mechanism. Our formalism yields predictive and quantitative insights: (1) Instance-wise success of pattern matching is well predicted by the number of contexts witnessing the relevant functional equivalence. (2) We prove a tight sample complexity bound of learning a two-hop structure by identifying the exponent of the data scaling law for perfect in-domain generalization. Our empirical results align with the theoretical prediction, under 20x parameter scaling and across architectures. (3) Path ambiguity is a structural barrier: when a variable influences the output via multiple paths, models fail to form unified intermediate state representations, impairing accuracy and interpretability. (4) Chain-of-Thought reduces data requirements yet does not resolve path ambiguity. Hence, we provide a predictive, falsifiable boundary for pattern matching and a foundational diagnostic for disentangling mixed generalization mechanisms.


LLM-ERM: Sample-Efficient Program Learning via LLM-Guided Search

Singhal, Shivam, Malach, Eran, Poggio, Tomaso, Galanti, Tomer

arXiv.org Artificial Intelligence

We seek algorithms for program learning that are both sample-efficient and computationally feasible. Classical results show that targets admitting short program descriptions (e.g., with short ``python code'') can be learned with a ``small'' number of examples (scaling with the size of the code) via length-first program enumeration, but the search is exponential in description length. Consequently, Gradient-based training avoids this cost yet can require exponentially many samples on certain short-program families. To address this gap, we introduce LLM-ERM, a propose-and-verify framework that replaces exhaustive enumeration with an LLM-guided search over candidate programs while retaining ERM-style selection on held-out data. Specifically, we draw $k$ candidates with a pretrained reasoning-augmented LLM, compile and check each on the data, and return the best verified hypothesis, with no feedback, adaptivity, or gradients. Theoretically, we show that coordinate-wise online mini-batch SGD requires many samples to learn certain short programs. {\em Empirically, LLM-ERM solves tasks such as parity variants, pattern matching, and primality testing with as few as 200 samples, while SGD-trained transformers overfit even with 100,000 samples}. These results indicate that language-guided program synthesis recovers much of the statistical efficiency of finite-class ERM while remaining computationally tractable, offering a practical route to learning succinct hypotheses beyond the reach of gradient-based training.


You Have Been LaTeXpOsEd: A Systematic Analysis of Information Leakage in Preprint Archives Using Large Language Models

Dubniczky, Richard A., Borsos, Bertalan, Norbert, Tihanyi

arXiv.org Artificial Intelligence

The widespread use of preprint repositories such as arXiv has accelerated the communication of scientific results but also introduced overlooked security risks. Beyond PDFs, these platforms provide unrestricted access to original source materials, including LaTeX sources, auxiliary code, figures, and embedded comments. In the absence of sanitization, submissions may disclose sensitive information that adversaries can harvest using open-source intelligence. In this work, we present the first large-scale security audit of preprint archives, analyzing more than 1.2 TB of source data from 100,000 arXiv submissions. We introduce LaTeXpOsEd, a four-stage framework that integrates pattern matching, logical filtering, traditional harvesting techniques, and large language models (LLMs) to uncover hidden disclosures within non-referenced files and LaTeX comments. To evaluate LLMs' secret-detection capabilities, we introduce LLMSec-DB, a benchmark on which we tested 25 state-of-the-art models. Our analysis uncovered thousands of PII leaks, GPS-tagged EXIF files, publicly available Google Drive and Dropbox folders, editable private SharePoint links, exposed GitHub and Google credentials, and cloud API keys. We also uncovered confidential author communications, internal disagreements, and conference submission credentials, exposing information that poses serious reputational risks to both researchers and institutions. We urge the research community and repository operators to take immediate action to close these hidden security gaps. To support open science, we release all scripts and methods from this study but withhold sensitive findings that could be misused, in line with ethical principles. The source code and related material are available at the project website https://github.com/LaTeXpOsEd


ML-Based Automata Simplification for Symbolic Accelerators

Yu, Tiffany, Stahle-Smith, Rye, Eswaramoorthi, Darssan, Karakchi, Rasha

arXiv.org Artificial Intelligence

Symbolic accelerators are increasingly used for symbolic data processing in domains such as genomics, NLP, and cybersecurity. However, these accelerators face scalability issues due to excessive memory use and routing complexity, especially when targeting a large set. We present AutoSlim, a machine learning-based graph simplification framework designed to reduce the complexity of symbolic accelerators built on Non-deterministic Finite Automata (NFA) deployed on FPGA-based overlays such as NAPOLY+. AutoSlim uses Random Forest classification to prune low-impact transitions based on edge scores and structural features, significantly reducing automata graph density while preserving semantic correctness. Unlike prior tools, AutoSlim targets automated score-aware simplification with weighted transitions, enabling efficient ranking-based sequence analysis. We evaluated data sets (1K to 64K nodes) in NAPOLY+ and conducted performance measurements including latency, throughput, and resource usage. AutoSlim achieves up to 40 percent reduction in FPGA LUTs and over 30 percent pruning in transitions, while scaling to graphs an order of magnitude larger than existing benchmarks. Our results also demonstrate how hardware interconnection (fanout) heavily influences hardware cost and that AutoSlim's pruning mitigates resource blowup.


Pattern Matching in AI Compilers and its Formalization (Extended Version)

Cutler, Joseph W., Collins, Alex, Fan, Bin, Ravishankar, Mahesh, Grover, Vinod

arXiv.org Artificial Intelligence

PyPM is a Python-based domain specific language (DSL) for building rewrite-based optimization passes on machine learning computation graphs. Users define individual optimizations by writing (a) patterns that match subgraphs of a computation graph and (b) corresponding rules which replace a matched subgraph with an optimized kernel. PyPM is distinguished from the many other DSLs for defining rewriting passes by its complex and novel pattern language which borrows concepts from logic programming. PyPM patterns can be recursive, nondeterminstic, and can require checking domain-specific constraints such as the shapes of tensors. The PyPM implementation is thus similarly complicated, consisting of thousands of lines of C++ code. In this paper, we present our work on building PyPM, as well as formalizing and distilling and this complexity to an understandable mathematical core. We have developed a formal core calculus expressing the main operations of the PyPM pattern language. We define both a declarative semantics - describing which patterns match which terms - and an algorithmic semantics - an idealized version of the PyPM pattern interpreter - and prove their equivalence. The development is fully mechanized in the Coq proof assistant.


Robustness Reprogramming for Representation Learning

Hou, Zhichao, Torkamani, MohamadAli, Krim, Hamid, Liu, Xiaorui

arXiv.org Machine Learning

This work tackles an intriguing and fundamental open challenge in representation learning: Given a well-trained deep learning model, can it be reprogrammed to enhance its robustness against adversarial or noisy input perturbations without altering its parameters? To explore this, we revisit the core feature transformation mechanism in representation learning and propose a novel non-linear robust pattern matching technique as a robust alternative. Furthermore, we introduce three model reprogramming paradigms to offer flexible control of robustness under different efficiency requirements. Comprehensive experiments and ablation studies across diverse learning models ranging from basic linear model and MLPs to shallow and modern deep ConvNets demonstrate the effectiveness of our approaches. This work not only opens a promising and orthogonal direction for improving adversarial defenses in deep learning beyond existing methods but also provides new insights into designing more resilient AI systems with robust statistics. Deep neural networks (DNNs) have made significant impacts across various domains due to their powerful capability of learning representation from high-dimensional data (LeCun et al., 2015; Goodfellow et al., 2016). However, it has been well-documented that DNNs are highly vulnerable to adversarial attacks (Szegedy, 2013; Biggio et al., 2013).


Learning Beyond Pattern Matching? Assaying Mathematical Understanding in LLMs

Guo, Siyuan, Didolkar, Aniket, Ke, Nan Rosemary, Goyal, Anirudh, Huszár, Ferenc, Schölkopf, Bernhard

arXiv.org Artificial Intelligence

Motivated by the use of LLM as a scientific assistant, our paper assesses the domain knowledge of LLMs We are beginning to see progress in language through their understanding of different mathematical model assisted scientific discovery. Motivated skills required to solve problems. Understanding by the use of LLMs as a general scientific can be measured in two ways: the degree to which it assistant, this paper assesses the domain solves problems correctly; and the degree to which it knowledge of LLMs through its understanding enables fast adaptation to new knowledge. Similarly, of different mathematical skills required "understanding" in an LLM has two facets: on the one to solve problems. In particular, we look at hand, pre-trained LLMs possess knowledge that allows not just what the pre-trained model already remarkable performance in zero-shot tasks; on the knows, but how it learned to learn from other hand, pre-trained LLMs can learn new knowledge, information during in-context learning or either by leveraging in-context learning or by instruction-tuning through exploiting the instruction-tuning from base parameters as initialization.


Optimizing the extended Fourier Mellin Transformation Algorithm

Jiang, Wenqing, Li, Chengqian, Cao, Jinyue, Schwertfeger, Sören

arXiv.org Artificial Intelligence

With the increasing application of robots, stable and efficient Visual Odometry (VO) algorithms are becoming more and more important. Based on the Fourier Mellin Transformation (FMT) algorithm, the extended Fourier Mellin Transformation (eFMT) is an image registration approach that can be applied to downward-looking cameras, for example on aerial and underwater vehicles. eFMT extends FMT to multi-depth scenes and thus more application scenarios. It is a visual odometry method which estimates the pose transformation between three overlapping images. On this basis, we develop an optimized eFMT algorithm that improves certain aspects of the method and combines it with back-end optimization for the small loop of three consecutive frames. For this we investigate the extraction of uncertainty information from the eFMT registration, the related objective function and the graph-based optimization. Finally, we design a series of experiments to investigate the properties of this approach and compare it with other VO and SLAM (Simultaneous Localization and Mapping) algorithms. The results show the superior accuracy and speed of our o-eFMT approach, which is published as open source.


Generative AI ChatGPT As Masterful Manipulator Of Humans, Worrying AI Ethics And AI Law

#artificialintelligence

Generative AI such as ChatGPT have been carrying on interactive online conversations meant to ... [ ] manipulate humans, raising serious concerns, We've all dealt with those manipulative personalities that try to convince us that up is down and aim to gaslight us into the most unsettling of conditions. Their rhetoric can be overtly powerful and overwhelming. You can't decide what to do. Should you merely cave in and hope that the verbal tirade will end? But if you are played into doing something untoward, acquiescing might be quite endangering. Trying to verbally fight back is bound to be ugly and can devolve into even worse circumstances. It can be a no-win situation, that's for sure. The manipulator wants and demands that things go their way. For them, the only win possible is that you completely capitulate to their professed bidding. They will incessantly verbally pound away with their claims of pure logic and try to make it appear as though they are occupying the high ground. You are made to seem inconsequential and incapable. Any number of verbal tactics will be launched at you, over and over again. Repetition and steamrolling are the insidious tools of those maddening manipulators. Turns out that we not only need to be on the watch for humans that are manipulators, but we now also need to be wary of Artificial Intelligence (AI) that does likewise. AI can be a masterful manipulator of humans. When it comes to AI, there is the hoped-for AI For Good, while in the same breath, we are faced with AI For Bad. I've previously covered in my columns that AI is considered to have a dual-use capacity, see my analysis at the link here. Seems that if we can make AI that can generate amazingly fluent and upbeat essays, the same capacity can be readily switched over to produce tremendously wrongful bouts of fluently overbearing manipulations. This is especially impactful when experienced in an interactive conversational dialogue with the AI. All of this happens via a type of AI known as Generative AI.


Crimes with Python's Pattern Matching • Hillel Wayne

#artificialintelligence

One of my favorite little bits of python is __subclasshook__. Abstract Base Classes with __subclasshook__ can define what counts as a subclass of the ABC, even if the target doesn't know about the ABC. You can do some weird stuff with this. Back in 2019 I used it to create non-monotonic types, where something counts as a NotIterable if it doesn't have the __iter__ method. There wasn't anything too diabolical you could do with this: nothing in Python really interacted with ABCs, limiting the damage you could do with production code.